Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run tests with miri #899

Closed
wants to merge 12 commits into from
Closed

Run tests with miri #899

wants to merge 12 commits into from

Conversation

swfsql
Copy link
Contributor

@swfsql swfsql commented Dec 5, 2023


  • Temporary patch on crossbeam.
    • Note that crossbeam is an indirect dependency and they are still to release a version that passes on this miri check. So to temporarily get this out of the way, you can patch crossbeam on the workspace dfdx/Cargo.toml.
    • Wait for crossbeam to make a new release and patch to that.
    • Then try to update dependencies so the patch is no longer needed.
  • Check all test pass by some feature combinations.
    • std,fast-alloc,cpu,nightly,safetensors
    • std,cpu,nightly,safetensors
    • no-std,fast-alloc,nightly
    • no-std,nightly
  • Reduce test sizes so a miri pass is practical

To run a particular test, add it's rust path as a filter at the end of the miri run command:

# cargo clean (run once when `cargo miri` is not used)
cargo clean

# run all tests (note: very slow and should take hours, assuming there are no errors)
MIRIFLAGS='-Zmiri-strict-provenance -Zmiri-symbolic-alignment-check -Zmiri-retag-fields -Zmiri-disable-isolation -Zmiri-tree-borrows' \
cargo miri nextest run --test-threads 1 --no-default-features --features std,fast-alloc,cpu,nightly,safetensors

Miri UB errors

  • None found, both under dfdx package (74 tests) and under dfdx-core(293 tests).

Memory/Thread leaks

  • nn::layers::add_into::tests::longer_network
    • Thread-leak error, disabling rayon on gemm for matmul seems to fix it. Commit: c6e0642
    • Investigate why the leak happened, try to re-enable rayon usage.
  • tensor::cpu::tests::test_reuse_allocation_on_clone_tensor
    • Memory-leak error, deallocating the cached allocations on TensorCache drop seems to fix it. Commit: ae777d4
    • Avoid panics during drops? Instead of panicking, can emit error message warning about possible memory leaks.
  • None other found, both under dfdx package (74 tests) and under dfdx-core(293 tests).

Smaller tensors

Some tests have been reduced in tensor size so miri runs faster. Those changes are specific to the following commits:

The tensor reduction is quite optional, they only makes it simpler to run miri "on everything". But if this change ends up persisting, this wiki page should be updated. Finally, Idk how mha/transformer work, so I can't tell if the reduction shouldn't happen. That is, each test reduction should be reviewed to ack their reduction doesn't lose anything that should be tested.


Note that Miri depends on the test suite to find something, so having a bigger test suite may help. This is to say that a Miri check could be re-run in the future.
I haven't verified the cpu test coverage over the cpu source code, but a good way to go would be to have a high coverage (every relevant line of code getting "tested"). If more assurance would be desired, there could be more tests aiming at different characteristics of the source code, such as those related to branch coverage etc.

swfsql added 5 commits March 1, 2024 11:15
Note that crossbeam is an indirect dependency and they are still to release a version that [passes](crossbeam-rs/crossbeam#996) on this miri check.

So to temporarily get this out of the way, you can patch crossbeam on the workspace `dfdx/Cargo.toml`.
Thread main finished before other threads were still active. The only
thing related to threads were the gemm usage of rayon, and disabling
rayon usage for matmul seems to be sufficient to fix this.
- Moved the impl of `Cache::try_empty_cache()` to `TensorCache::clear()`.
  - This can be invoked both by `Cache::try_empty_cache()` and by `drop(TensorCache)`.
- Moved the device cache ptr deallocation to `BytesPtr`, `CudaBytesPtr` (newtype over `CUdeviceptr`) and `Buffer`.
  - This is abstracted by the `CachePtr` trait.
  - Can be called by `TensorCache::clear()`.
  - This method may require some "extra" device information, such as in the cuda case. That information is held by `TensorCache`.
@swfsql
Copy link
Contributor Author

swfsql commented Mar 1, 2024

I'll be closing this as this was enough to indicate robustness on the library (at least in my opinion!), but nevertheless this can still be used as a reference whenever someone chooses to do a miri trial.

@swfsql swfsql closed this Mar 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Reduce test sizes Run tests with miri
2 participants